431 Class 20

Thomas E. Love, Ph.D.

2023-11-09

Today’s Agenda

Some NHANES National Youth Fitness Survey (2012) data

  1. Exploration and Initial Data Summaries
    • Dealing with Missingness then Partitioning
  2. How might we transform our outcome?
  3. Building Candidate Prediction Models
  4. Checking Regression Assumptions
  5. Assessing the candidates, in training and test samples

Today’s Packages

library(janitor)       # clean_names(), tabyl(), etc. 
library(naniar)        # identifying/dealing with NA
library(patchwork)     # combining ggplot2 plots
library(broom)         # for tidy(), glance(), augment()
library(car)           # for boxCox
library(corrr)         # for correlation matrix
library(GGally)        # for scatterplot matrix
library(ggrepel)       # help with residual plots
library(gt)            # formatting tables
library(gtsummary)     # for tbl_regression() 
library(mosaic)        # favstats, df_stats & inspect
library(sessioninfo)   # for session_info()
library(simputation)   # for single imputation
library(tidyverse)

theme_set(theme_bw())
options(tidyverse.quiet = TRUE)
options(dplyr.summarise.inform = FALSE)

NHANES National Youth Fitness Survey (from 2012)

nnyfs_full <- read_rds("c20/data/nnyfs.Rds")
nnyfs1 <- nnyfs_full |> 
  rename(triceps = triceps_skinfold, age = age_child, 
         health = phys_health) |>
  mutate(asthma = fct_recode(factor(asthma_ever), 
                                  "Asthma" = "History of Asthma", 
                                  "Never" = "Never Had Asthma")) |>
  select(triceps, waist, age, asthma, health, SEQN)
dim(nnyfs1)
[1] 1518    6

The nnyfs1 data

# A tibble: 1,518 × 6
   triceps waist   age asthma health       SEQN
     <dbl> <dbl> <dbl> <fct>  <fct>       <dbl>
 1    NA    NA      15 Never  1_Excellent 71917
 2    19.9  71.9     8 Asthma 3_Good      71918
 3    15    79.4    14 Never  1_Excellent 71919
 4    20.6  96.4    15 Asthma 3_Good      71920
 5     8.6  46.8     3 Never  1_Excellent 71921
 6    22.8  90      12 Never  1_Excellent 71922
 7    20.5  72.3    12 Never  3_Good      71923
 8    12.9  56.1     8 Never  1_Excellent 71924
 9     6.9  54.5     7 Never  2_VeryGood  71925
10     8.8  59.7     8 Never  1_Excellent 71926
# ℹ 1,508 more rows

Which variables will we study today?

Variable Description
triceps Triceps skinfold (mm), our outcome
waist Waist circumference (cm)
age Age in years at screening (3 to 15)
asthma Ever told you have asthma?
health Self-reported Health (Excellent to Poor)
SEQN just a subject identifying code

Missingness?

miss_var_summary(nnyfs1)
# A tibble: 6 × 3
  variable n_miss pct_miss
  <chr>     <int>    <dbl>
1 triceps      21    1.38 
2 waist         6    0.395
3 age           0    0    
4 asthma        0    0    
5 health        0    0    
6 SEQN          0    0    
miss_case_table(nnyfs1)
# A tibble: 3 × 3
  n_miss_in_case n_cases pct_cases
           <int>   <int>     <dbl>
1              0    1496    98.6  
2              1      17     1.12 
3              2       5     0.329

How often are we missing waist when we have triceps?

Little’s MCAR test

Could we assume MCAR for these missing values?

mcar_test(nnyfs1) ## from naniar
# A tibble: 1 × 4
  statistic    df p.value missing.patterns
      <dbl> <dbl>   <dbl>            <int>
1      19.5    14   0.148                4

Null hypothesis here is that the missingness is MCAR. Conclusions?

Dealing with Missingness

I don’t want to do single imputation on the outcome, and it seems we can still assume MCAR, so we’ll filter to complete cases on our outcome, and then impute the one remaining missing waist value.

nnyfs <- nnyfs1 |> 
  filter(complete.cases(triceps)) |> # don't want to impute outcome
  impute_rlm(waist ~ age + triceps + health + asthma)

miss_case_table(nnyfs)
# A tibble: 1 × 3
  n_miss_in_case n_cases pct_cases
           <int>   <int>     <dbl>
1              0    1497       100

nnyfs numerical summaries

Quantitative variables…

df_stats(~ triceps + waist + age, data = nnyfs) |> 
    gt() |>  fmt_number(decimals = 2, columns = c(mean, sd)) |>
    tab_options(table.font.size = 24)
response min Q1 median Q3 max mean sd n missing
triceps 4.0 9.1 12.4 18.0 38.8 14.36 6.76 1497 0
waist 42.5 55.6 64.8 76.6 132.3 67.62 14.94 1497 0
age 3.0 6.0 9.0 12.0 15.0 9.04 3.69 1497 0

nnyfs numerical summaries

Categorical Variables…

nnyfs |> tabyl(asthma, health) |> 
  adorn_totals(where = c("row", "col"))
 asthma 1_Excellent 2_VeryGood 3_Good 4_Fair 5_Poor Total
 Asthma          73         74     96     12      1   256
  Never         660        346    213     21      1  1241
  Total         733        420    309     33      2  1497

Collapsing to 3 health Levels

  • Let’s collapse together Good, Fair and Poor so we have a reasonable sample size (> 50 for today) in each cell.
nnyfs <- nnyfs |>
  mutate(health = fct_collapse(health,
                     Other = c("3_Good", "4_Fair", "5_Poor"),
                     Excellent = "1_Excellent",
                     VeryGood = "2_VeryGood"),
         health = fct_relevel(health, "Excellent", "VeryGood"))

nnyfs |> tabyl(asthma, health) |> 
  adorn_totals(where = c("row", "col"))
 asthma Excellent VeryGood Other Total
 Asthma        73       74   109   256
  Never       660      346   235  1241
  Total       733      420   344  1497

Resulting nnyfs set

inspect(nnyfs)  ## from mosaic package

categorical variables:  
    name  class levels    n missing
1 asthma factor      2 1497       0
2 health factor      3 1497       0
                                   distribution
1 Never (82.9%), Asthma (17.1%)                
2 Excellent (49%), VeryGood (28.1%) ...        

quantitative variables:  
     name   class     min      Q1  median      Q3     max         mean
1 triceps numeric     4.0     9.1    12.4    18.0    38.8    14.357248
2   waist numeric    42.5    55.6    64.8    76.6   132.3    67.620997
3     age numeric     3.0     6.0     9.0    12.0    15.0     9.044088
4    SEQN numeric 71918.0 72307.0 72700.0 73102.0 73492.0 72703.434870
          sd    n missing
1   6.758825 1497       0
2  14.943870 1497       0
3   3.685777 1497       0
4 457.310995 1497       0

Partitioning the nnyfs data

Let’s put 1000 nnyfs samples (66.8%) in the training set, leaving 497 for the test set.

set.seed(431020)
nnyfs_train <- slice_sample(nnyfs, n = 1000, replace = FALSE)
nnyfs_test <- anti_join(nnyfs, nnyfs_train, by = "SEQN")

nrow(nnyfs_train); nrow(nnyfs_test)
[1] 1000
[1] 497

DTDP: triceps skinfold in mm

p1 <- ggplot(nnyfs_train, aes(x = triceps)) +
  geom_histogram(bins = 20, fill = "steelblue", col = "white")

p2 <- ggplot(nnyfs_train, aes(sample = triceps)) + 
  geom_qq(col = "steelblue") + geom_qq_line(col = "tomato") +
  labs(y = "Observed triceps", x = "Normal (0,1) quantiles") + 
  theme(aspect.ratio = 1)

p3 <- ggplot(nnyfs_train, aes(x = "", y = triceps)) +
  geom_violin(fill = "steelblue", alpha = 0.1) + 
  geom_boxplot(fill = "steelblue", width = 0.3, notch = TRUE,
               outlier.color = "steelblue", outlier.size = 3) +
  labs(x = "") + coord_flip()

p1 + p2 - p3 +
  plot_layout(ncol = 1, height = c(3, 2)) + 
  plot_annotation(title = "Triceps Skinfold (mm)",
         subtitle = str_glue("Model Development Sample: ", nrow(nnyfs_train), 
                           " children in NNYFS 2012"))

DTDP: triceps skinfold in mm

Consider transforming the outcome?

mod_0 <- lm(triceps ~ waist + age + asthma + health, data = nnyfs)
boxCox(mod_0)

Power Transformation suggestions

summary(powerTransform(mod_0))
bcPower Transformation to Normality 
   Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
Y1    0.3167        0.33       0.2252       0.4083

Likelihood ratio test that transformation parameter is equal to 0
 (log transformation)
                           LRT df       pval
LR test, lambda = (0) 43.93977  1 3.3864e-11

Likelihood ratio test that no transformation is needed
                           LRT df       pval
LR test, lambda = (1) 230.6417  1 < 2.22e-16

I guess we could use the cube root transformation: \(\mbox{triceps}^{1/3}\) but I’ll go with \(\sqrt{\mbox{triceps}}\) as a starting point, because it’s somehow less daunting.

DTDP: \(\sqrt{\mbox{triceps}}\)

p1 <- ggplot(nnyfs_train, aes(x = sqrt(triceps))) +
  geom_histogram(bins = 20, fill = "slateblue", col = "white")

p2 <- ggplot(nnyfs_train, aes(sample = sqrt(triceps))) + 
  geom_qq(col = "slateblue") + geom_qq_line(col = "violetred") +
  labs(y = "Observed sqrt(triceps)", x = "Normal (0,1) quantiles") + 
  theme(aspect.ratio = 1)

p3 <- ggplot(nnyfs_train, aes(x = "", y = sqrt(triceps))) +
  geom_violin(fill = "slateblue", alpha = 0.1) + 
  geom_boxplot(fill = "slateblue", width = 0.3, notch = TRUE,
               outlier.color = "slateblue", outlier.size = 3) +
  labs(x = "") + coord_flip()

p1 + p2 - p3 +
  plot_layout(ncol = 1, height = c(3, 2)) + 
  plot_annotation(title = "Square Root of Triceps Skinfold (mm)",
         subtitle = str_glue("Model Development Sample: ", nrow(nnyfs_train), 
                           " children in NNYFS 2012"))

DTDP: \(\sqrt{\mbox{triceps}}\)

DTDP: Is \(\mbox{triceps}^{1/3}\) much better?

p1 <- ggplot(nnyfs_train, aes(x = triceps^(1/3))) +
  geom_histogram(bins = 20, fill = "royalblue", col = "white")

p2 <- ggplot(nnyfs_train, aes(sample = triceps^(1/3))) + 
  geom_qq(col = "royalblue") + geom_qq_line(col = "magenta") +
  labs(y = "Observed sqrt(triceps)", x = "Normal (0,1) quantiles") + 
  theme(aspect.ratio = 1)

p3 <- ggplot(nnyfs_train, aes(x = "", y = triceps^(1/3))) +
  geom_violin(fill = "royalblue", alpha = 0.1) + 
  geom_boxplot(fill = "royalblue", width = 0.3, notch = TRUE,
               outlier.color = "royalblue", outlier.size = 3) +
  labs(x = "") + coord_flip()

p1 + p2 - p3 +
  plot_layout(ncol = 1, height = c(3, 2)) + 
  plot_annotation(title = "Cube Root of Triceps Skinfold (mm)",
         subtitle = str_glue("Model Development Sample: ", nrow(nnyfs_train), 
                           " children in NNYFS 2012"))

DTDP: Is \(\mbox{triceps}^{1/3}\) much better?

Identify and Fit Candidate Models in training sample

Scatterplot Matrix

  • Again, I select the outcome (\(\sqrt{\mbox{triceps}}\)) last, so the bottom row shows the most important scatterplots.
temp <- nnyfs_train |> 
  mutate(sqrt_tri = sqrt(triceps)) |>
  select(waist, age, asthma, health, sqrt_tri)

ggpairs(temp, 
    title = "Scatterplots: Model Development Sample",
    lower = list(combo = wrap("facethist", bins = 20)))

Scatterplot Matrix

What about the categories?

What does our outcome look like by asthma category?

favstats(sqrt_tri ~ asthma, data = temp) |> gt() |>
    fmt_number(decimals = 2, columns = -c(n, missing)) |>
    tab_options(table.font.size = 24)
asthma min Q1 median Q3 max mean sd n missing
Asthma 2.28 3.16 3.74 4.49 6.20 3.85 0.92 163 0
Never 2.00 3.00 3.49 4.20 6.21 3.66 0.83 837 0

What about the categories?

How about the distribution by health category?

favstats(sqrt_tri ~ health, data = temp) |> gt() |>
    fmt_number(decimals = 2, columns = -c(n, missing)) |>
    tab_options(table.font.size = 24)
health min Q1 median Q3 max mean sd n missing
Excellent 2.00 2.95 3.42 4.07 6.03 3.57 0.79 500 0
VeryGood 2.30 3.04 3.61 4.34 6.20 3.76 0.87 274 0
Other 2.14 3.19 3.75 4.54 6.21 3.87 0.89 226 0

Correlation Matrix?

temp <- nnyfs_train |> 
  mutate(sqrt_tri = sqrt(triceps)) |>
  select(waist, age, asthma, health, sqrt_tri)

temp |> correlate() |>  
  gt() |> fmt_number(decimals = 3) |> 
  tab_options(table.font.size = 24)
term waist age sqrt_tri
waist NA 0.676 0.789
age 0.676 NA 0.353
sqrt_tri 0.789 0.353 NA

Any signs of meaningful collinearity here?

Checking Variance Inflation Factors

vif(lm(triceps ~ waist + age + asthma + health, 
       data = nnyfs))
           GVIF Df GVIF^(1/(2*Df))
waist  1.890109  1        1.374813
age    1.856713  1        1.362612
asthma 1.069957  1        1.034387
health 1.081987  2        1.019895

Any major concerns here? What are we looking for?

Use step to pick a candidate?

We’ll use backwards selection, starting with a model including all four predictors (the “kitchen sink” model).

step(lm(sqrt(triceps) ~ waist + age + asthma + health, 
        data = nnyfs_train))

Use step to pick a candidate?

Start:  AIC=-1483.18
sqrt(triceps) ~ waist + age + asthma + health

         Df Sum of Sq    RSS      AIC
- asthma  1      0.02 224.23 -1485.10
<none>                224.21 -1483.18
- health  2      2.92 227.13 -1474.22
- age     1     41.72 265.93 -1314.52
- waist   1    386.76 610.97  -482.71

Step:  AIC=-1485.1
sqrt(triceps) ~ waist + age + health

         Df Sum of Sq    RSS     AIC
<none>                224.23 -1485.1
- health  2      2.94 227.17 -1476.1
- age     1     41.74 265.97 -1316.4
- waist   1    387.24 611.46  -483.9

Call:
lm(formula = sqrt(triceps) ~ waist + age + health, data = nnyfs_train)

Coefficients:
   (Intercept)           waist             age  healthVeryGood     healthOther  
       0.46543         0.05709        -0.07535         0.12337         0.00586  

An Important Point

Stepwise regression lands on our mod_1, as it turns out.

  • There is a huge amount of evidence that variable selection causes severe problems in estimation and inference.
  • Stepwise regression is an especially bad choice.
  • Disappointingly, there really isn’t a great choice. The task itself just isn’t one we can do well in a uniform way across all of the different types of regression models we’ll build.

More on this in 432.

Which models will we fit?

Model waist age asthma health
modA Yes No No No
modB Yes Yes No No
modC Yes Yes No Yes
modD Yes Yes Yes Yes
  • Stepwise backwards selection via AIC suggested modC.
  • “Kitchen Sink” is modD
  • Today, each model is a subset of the next model.

Fitting modA through modD

Using the training sample…

modA <- lm(sqrt(triceps) ~ waist, 
           data = nnyfs_train)
modB <- lm(sqrt(triceps) ~ waist + age, 
           data = nnyfs_train)
modC <- lm(sqrt(triceps) ~ waist + age + health, 
           data = nnyfs_train)
modD <- lm(sqrt(triceps) ~ waist + age + asthma + health, 
           data = nnyfs_train)

Scatterplot with modA

ggplot(data = nnyfs_train, aes(x = waist, y = sqrt(triceps))) +
  geom_point(col = "slateblue") +
  geom_smooth(method = "lm", formula = y ~ x, se = TRUE, col = "red") +
  labs(x = "Waist Circumference (cm)", 
       y = "Square Root of Triceps Skinfold (mm)", 
       title = "modA and our data", 
       subtitle = "nnyfs Training Sample (n = 1000)", 
       caption = "R-squared = 0.623") + 
  theme(aspect.ratio = 1)

Scatterplot with modA

modA: waist only

glance(modA) |> mutate(name = "modA") |>
  select(name, r.squared, adj.r.squared, sigma, AIC, BIC, nobs, df) |>
  gt() |> fmt_number(decimals = 3, columns = c(2:4)) |>
    fmt_number(decimals = 1, columns = c(5:6)) |>
    tab_options(table.font.size = 24)
name r.squared adj.r.squared sigma AIC BIC nobs df
modA 0.623 0.622 0.520 1,534.3 1,549.1 1000 1
tidy(modA, conf.int = TRUE, conf.level = 0.90) |>
  select(term, estimate, std.error, p.value, conf.low, conf.high) |>
  gt() |> fmt_number(decimals = 3) |> tab_options(table.font.size = 24)
term estimate std.error p.value conf.low conf.high
(Intercept) 0.672 0.076 0.000 0.547 0.797
waist 0.044 0.001 0.000 0.043 0.046

modB: waist and age

glance(modB) |> mutate(name = "modB") |>
  select(name, r.squared, adj.r.squared, sigma, AIC, BIC, nobs, df) |>
  gt() |> fmt_number(decimals = 3, columns = c(2:4)) |>
    fmt_number(decimals = 1, columns = c(5:6)) |>
    tab_options(table.font.size = 24)
name r.squared adj.r.squared sigma AIC BIC nobs df
modB 0.682 0.682 0.477 1,363.8 1,383.4 1000 2
tidy(modB, conf.int = TRUE, conf.level = 0.90) |>
  select(term, estimate, std.error, p.value, conf.low, conf.high) |>
  gt() |> fmt_number(decimals = 3) |> tab_options(table.font.size = 24)
term estimate std.error p.value conf.low conf.high
(Intercept) 0.504 0.071 0.000 0.387 0.620
waist 0.057 0.001 0.000 0.055 0.059
age −0.076 0.006 0.000 −0.085 −0.067

modC: waist, age and health

name r.squared adj.r.squared sigma AIC BIC nobs df
modC 0.687 0.685 0.475 1,354.8 1,384.2 1000 4
term estimate std.error p.value conf.low conf.high
(Intercept) 0.465 0.071 0.000 0.348 0.583
waist 0.057 0.001 0.000 0.055 0.059
age −0.075 0.006 0.000 −0.084 −0.066
healthVeryGood 0.123 0.036 0.001 0.065 0.182
healthOther 0.006 0.039 0.879 −0.058 0.069

modD: waist, age, asthma & health

name r.squared adj.r.squared sigma AIC BIC nobs df
modD 0.687 0.685 0.475 1,356.7 1,391.0 1000 5
term estimate std.error p.value conf.low conf.high
(Intercept) 0.477 0.083 0.000 0.341 0.614
waist 0.057 0.001 0.000 0.055 0.059
age −0.075 0.006 0.000 −0.085 −0.066
asthmaNever −0.012 0.042 0.779 −0.080 0.057
healthVeryGood 0.123 0.036 0.001 0.063 0.182
healthOther 0.004 0.039 0.921 −0.061 0.069

The Four Models

Four regression equations we’ve fit to the training sample…

  • modA: \(\sqrt{\mbox{triceps}}\) = 0.672 + 0.044 waist
  • modB: \(\sqrt{\mbox{triceps}}\) = 0.504 + 0.057 waist - 0.076 age
  • modC: \(\sqrt{\mbox{triceps}}\) = 0.465 + 0.057 waist - 0.075 age + 0.123 (health = Very Good) + 0.006 (health = Other)
  • modD: \(\sqrt{\mbox{triceps}}\) = 0.477 + 0.057 waist - 0.075 age - 0.012 (asthma = Never) + 0.123 (health = Very Good) + 0.004 (health = Other)

Combining the glance() results

bind_rows(glance(modA) |> mutate(name = "modA"),
          glance(modB) |> mutate(name = "modB"),
          glance(modC) |> mutate(name = "modC"),
          glance(modD) |> mutate(name = "modD")) |>
  select(name, r.squared, adj.r.squared, sigma, AIC, BIC, nobs, df) |>
  gt() |> fmt_number(decimals = 5, columns = 2) |>
    fmt_number(decimals = 4, columns = c(3:4)) |>
    fmt_number(decimals = 0, columns = c(5:6)) |>
    tab_options(table.font.size = 24)
name r.squared adj.r.squared sigma AIC BIC nobs df
modA 0.62267 0.6223 0.5201 1,534 1,549 1000 1
modB 0.68247 0.6818 0.4773 1,364 1,383 1000 2
modC 0.68658 0.6853 0.4747 1,355 1,384 1000 4
modD 0.68661 0.6850 0.4749 1,357 1,391 1000 5

Regression Assumptions & Residual Plots via ggplot2

Checking Regression Assumptions

Four key assumptions we need to think about:

  1. Linearity
  2. Constant Variance (Homoscedasticity)
  3. Normality
  4. Independence

How do we assess 1, 2, and 3? Residual plots.

Augmenting our training data

aug_A <- augment(modA, data = nnyfs_train) |>
  mutate(sqrt_tri = sqrt(triceps)) # add in our model's outcome

aug_B <- augment(modB, data = nnyfs_train) |>
  mutate(sqrt_tri = sqrt(triceps)) # add in our model's outcome

aug_C <- augment(modC, data = nnyfs_train) |>
  mutate(sqrt_tri = sqrt(triceps)) # add in our model's outcome

aug_D <- augment(modD, data = nnyfs_train) |>
  mutate(sqrt_tri = sqrt(triceps)) # add in our model's outcome

Residuals vs. Fitted Values: ggplot2

ggplot(aug_A, aes(x = .fitted, y = .resid)) +
  geom_point() + 
  geom_point(data = aug_A |> 
               slice_max(abs(.resid), n = 5),
             col = "red", size = 2) +
  geom_text_repel(data = aug_A |> 
               slice_max(abs(.resid), n = 5),
               aes(label = SEQN), col = "red") +
  geom_abline(intercept = 0, slope = 0, lty = "dashed") +
  geom_smooth(method = "loess", formula = y ~ x, se = F) +
  labs(title = "Residuals vs. Fitted Values from modA",
       caption = "5 largest |residuals| highlighted in red.",
       x = "Fitted Value of sqrt(triceps)", y = "Residual") +
  theme(aspect.ratio = 1)

Residuals vs. Fitted Values: ggplot2

Standardized Residuals: ggplot2

p1 <- ggplot(aug_A, aes(sample = .std.resid)) +
  geom_qq() + 
  geom_qq_line(col = "red") +
  labs(title = "Normal Q-Q plot",
       y = "Standardized Residual from modA", 
       x = "Standard Normal Quantiles") +
  theme(aspect.ratio = 1)

p2 <- ggplot(aug_A, aes(y = .std.resid, x = "")) +
  geom_violin(fill = "ivory") +
  geom_boxplot(width = 0.3) +
  labs(title = "Box and Violin Plots",
       y = "Standardized Residual from modA",
       x = "modA")

p1 + p2 + 
  plot_layout(widths = c(2, 1)) +
  plot_annotation(
    title = "Normality of Standardized Residuals from modA",
    caption = str_glue("n = ", nrow(aug_A |> select(.std.resid)),
                     " residual values are plotted here."))

Standardized Residuals: ggplot2

Scale-Location Plot via ggplot2

ggplot(aug_A, aes(x = .fitted, y = sqrt(abs(.std.resid)))) +
  geom_point() + 
  geom_point(data = aug_A |> 
               slice_max(sqrt(abs(.std.resid)), n = 3),
             col = "red", size = 1) +
  geom_text_repel(data = aug_A |> 
               slice_max(sqrt(abs(.std.resid)), n = 3),
               aes(label = SEQN), col = "red") +
  geom_smooth(method = "loess", formula = y ~ x, se = F) +
  labs(title = "Scale-Location Plot for modA",
       caption = "3 largest |Standardized Residual| in red.",
       x = "Fitted Value of sqrt(triceps)", 
       y = "Square Root of |Standardized Residual|") +
  theme(aspect.ratio = 1)

Scale-Location Plot via ggplot2

Cook’s Distance Index Plot via ggplot2

aug_A_extra <- aug_A |> 
  mutate(obsnum = 1:nrow(aug_A |> select(.cooksd)))

ggplot(aug_A_extra, aes(x = obsnum, y = .cooksd)) + 
  geom_point() + 
  geom_text_repel(data = aug_A_extra |> 
               slice_max(.cooksd, n = 3),
               aes(label = SEQN)) +
  labs(x = "Observation Number",
       y = "Cook's Distance")

Cook’s Distance Index Plot via ggplot2

Residuals vs. Leverage Plot via ggplot2

ggplot(aug_A, aes(x = .hat, y = .std.resid)) +
  geom_point() + 
  geom_point(data = aug_A |> filter(.cooksd >= 0.5),
             col = "red", size = 2) +
  geom_text_repel(data = aug_A |> filter(.cooksd >= 0.5),
               aes(label = SEQN), col = "red") +
  geom_smooth(method = "loess", formula = y ~ x, se = F) +
  geom_vline(aes(xintercept = 3*mean(.hat)), lty = "dashed") +
  labs(title = "Residuals vs. Leverage from modA",
       caption = "Red points indicate Cook's d at least 0.5",
       x = "Leverage", y = "Standardized Residual") +
  theme(aspect.ratio = 1)

Residuals vs. Leverage Plot via ggplot2

Some Notes on the Residuals vs. Leverage Plot

In this ggplot() approach,

  • Points with Cook’s d >= 0.5 would be highlighted and in red, if there were any.
  • Points right of the dashed line have high leverage have more than 3 times the average leverage, and so are identified as highly leveraged by some people.

ggplot2 Residual Plots: modA

p1 <- ggplot(aug_A, aes(x = .fitted, y = .resid)) +
  geom_point() + 
  geom_point(data = aug_A |> 
               slice_max(abs(.resid), n = 3),
             col = "red", size = 2) +
  geom_text_repel(data = aug_A |> 
               slice_max(abs(.resid), n = 3),
               aes(label = SEQN), col = "red") +
  geom_abline(intercept = 0, slope = 0, lty = "dashed") +
  geom_smooth(method = "loess", formula = y ~ x, se = F) +
  labs(title = "Residuals vs. Fitted",
       x = "Fitted Value of sqrt(triceps)", y = "Residual") 

p2 <- ggplot(aug_A, aes(sample = .std.resid)) +
  geom_qq() + 
  geom_qq_line(col = "red") +
  labs(title = "Normal Q-Q plot",
       y = "Standardized Residual", 
       x = "Standard Normal Quantiles") 

p3 <- ggplot(aug_A, aes(x = .fitted, y = sqrt(abs(.std.resid)))) +
  geom_point() + 
  geom_smooth(method = "loess", formula = y ~ x, se = F) +
  labs(title = "Scale-Location Plot",
       x = "Fitted Value of sqrt(triceps)", 
       y = "|Std. Residual|^(1/2)") 

p4 <- ggplot(aug_A, aes(x = .hat, y = .std.resid)) +
  geom_point() + 
  geom_point(data = aug_A |> filter(.cooksd >= 0.5),
             col = "red", size = 2) +
  geom_text_repel(data = aug_A |> filter(.cooksd >= 0.5),
               aes(label = SEQN), col = "red") +
  geom_smooth(method = "loess", formula = y ~ x, se = F) +
  geom_vline(aes(xintercept = 3*mean(.hat)), lty = "dashed") +
  labs(title = "Residuals vs. Leverage",
       x = "Leverage", y = "Standardized Residual") 

(p1 + p2) / (p3 + p4) +
  plot_annotation(title = "Assessing Residuals for modA",
                  caption = "If applicable, Cook's d >= 0.5 shown in red in bottom right plot.")

ggplot2 Residual Plots: modA

ggplot2 Residual Plots: modB

p1 <- ggplot(aug_B, aes(x = .fitted, y = .resid)) +
  geom_point() + 
  geom_point(data = aug_B |> 
               slice_max(abs(.resid), n = 3),
             col = "red", size = 2) +
  geom_text_repel(data = aug_B |> 
               slice_max(abs(.resid), n = 3),
               aes(label = SEQN), col = "red") +
  geom_abline(intercept = 0, slope = 0, lty = "dashed") +
  geom_smooth(method = "loess", formula = y ~ x, se = F) +
  labs(title = "Residuals vs. Fitted",
       x = "Fitted Value of sqrt(triceps)", y = "Residual") 

p2 <- ggplot(aug_B, aes(sample = .std.resid)) +
  geom_qq() + 
  geom_qq_line(col = "red") +
  labs(title = "Normal Q-Q plot",
       y = "Standardized Residual", 
       x = "Standard Normal Quantiles") 

p3 <- ggplot(aug_B, aes(x = .fitted, y = sqrt(abs(.std.resid)))) +
  geom_point() + 
  geom_smooth(method = "loess", formula = y ~ x, se = F) +
  labs(title = "Scale-Location Plot",
       x = "Fitted Value of sqrt(triceps)", 
       y = "|Std. Residual|^(1/2)") 

p4 <- ggplot(aug_B, aes(x = .hat, y = .std.resid)) +
  geom_point() + 
  geom_point(data = aug_B |> filter(.cooksd >= 0.5),
             col = "red", size = 2) +
  geom_text_repel(data = aug_B |> filter(.cooksd >= 0.5),
               aes(label = SEQN), col = "red") +
  geom_smooth(method = "loess", formula = y ~ x, se = F) +
  geom_vline(aes(xintercept = 3*mean(.hat)), lty = "dashed") +
  labs(title = "Residuals vs. Leverage",
       x = "Leverage", y = "Standardized Residual") 

(p1 + p2) / (p3 + p4) +
  plot_annotation(title = "Assessing Residuals for modB",
                  caption = "If applicable, Cook's d >= 0.5 shown in red in bottom right plot.")

ggplot2 Residual Plots: modB

ggplot2 Residual Plots: modC

p1 <- ggplot(aug_C, aes(x = .fitted, y = .resid)) +
  geom_point() + 
  geom_point(data = aug_C |> 
               slice_max(abs(.resid), n = 3),
             col = "red", size = 2) +
  geom_text_repel(data = aug_C |> 
               slice_max(abs(.resid), n = 3),
               aes(label = SEQN), col = "red") +
  geom_abline(intercept = 0, slope = 0, lty = "dashed") +
  geom_smooth(method = "loess", formula = y ~ x, se = F) +
  labs(title = "Residuals vs. Fitted",
       x = "Fitted Value of sqrt(triceps)", y = "Residual") 

p2 <- ggplot(aug_C, aes(sample = .std.resid)) +
  geom_qq() + 
  geom_qq_line(col = "red") +
  labs(title = "Normal Q-Q plot",
       y = "Standardized Residual", 
       x = "Standard Normal Quantiles") 

p3 <- ggplot(aug_C, aes(x = .fitted, y = sqrt(abs(.std.resid)))) +
  geom_point() + 
  geom_smooth(method = "loess", formula = y ~ x, se = F) +
  labs(title = "Scale-Location Plot",
       x = "Fitted Value of sqrt(triceps)", 
       y = "|Std. Residual|^(1/2)") 

p4 <- ggplot(aug_C, aes(x = .hat, y = .std.resid)) +
  geom_point() + 
  geom_point(data = aug_C |> filter(.cooksd >= 0.5),
             col = "red", size = 2) +
  geom_text_repel(data = aug_C |> filter(.cooksd >= 0.5),
               aes(label = SEQN), col = "red") +
  geom_smooth(method = "loess", formula = y ~ x, se = F) +
  geom_vline(aes(xintercept = 3*mean(.hat)), lty = "dashed") +
  labs(title = "Residuals vs. Leverage",
       x = "Leverage", y = "Standardized Residual") 

(p1 + p2) / (p3 + p4) +
  plot_annotation(title = "Assessing Residuals for modC",
                  caption = "If applicable, Cook's d >= 0.5 shown in red in bottom right plot.")

ggplot2 Residual Plots: modC

ggplot2 Residual Plots: modD

p1 <- ggplot(aug_D, aes(x = .fitted, y = .resid)) +
  geom_point() + 
  geom_point(data = aug_D |> 
               slice_max(abs(.resid), n = 3),
             col = "red", size = 2) +
  geom_text_repel(data = aug_D |> 
               slice_max(abs(.resid), n = 3),
               aes(label = SEQN), col = "red") +
  geom_abline(intercept = 0, slope = 0, lty = "dashed") +
  geom_smooth(method = "loess", formula = y ~ x, se = F) +
  labs(title = "Residuals vs. Fitted",
       x = "Fitted Value of sqrt(triceps)", y = "Residual") 

p2 <- ggplot(aug_D, aes(sample = .std.resid)) +
  geom_qq() + 
  geom_qq_line(col = "red") +
  labs(title = "Normal Q-Q plot",
       y = "Standardized Residual", 
       x = "Standard Normal Quantiles") 

p3 <- ggplot(aug_D, aes(x = .fitted, y = sqrt(abs(.std.resid)))) +
  geom_point() + 
  geom_smooth(method = "loess", formula = y ~ x, se = F) +
  labs(title = "Scale-Location Plot",
       x = "Fitted Value of sqrt(triceps)", 
       y = "|Std. Residual|^(1/2)") 

p4 <- ggplot(aug_D, aes(x = .hat, y = .std.resid)) +
  geom_point() + 
  geom_point(data = aug_D |> filter(.cooksd >= 0.5),
             col = "red", size = 2) +
  geom_text_repel(data = aug_D |> filter(.cooksd >= 0.5),
               aes(label = SEQN), col = "red") +
  geom_smooth(method = "loess", formula = y ~ x, se = F) +
  geom_vline(aes(xintercept = 3*mean(.hat)), lty = "dashed") +
  labs(title = "Residuals vs. Leverage",
       x = "Leverage", y = "Standardized Residual") 

(p1 + p2) / (p3 + p4) +
  plot_annotation(title = "Assessing Residuals for modD",
                  caption = "If applicable, Cook's d >= 0.5 shown in red in bottom right plot.")

ggplot2 Residual Plots: modD

Assessing the Candidates (in the test sample)

modA prediction errors in test sample

Since the way to back out of the square root transformation is to take the square, we will square the fitted values provided by augment and then calculate residuals on the original scale, as follows…

test_mA <- augment(modA, newdata = nnyfs_test) |>
  mutate(name = "modA", fit_triceps = .fitted^2,
         res_triceps = triceps - fit_triceps) 

What does test_mA now include?

test_mA |>
  select(triceps, fit_triceps, res_triceps, SEQN, name, everything()) |> 
  head() |> gt() |> fmt_number(decimals = 2, columns = c(2:3, 10:11)) |>
  tab_options(table.font.size = 24)
triceps fit_triceps res_triceps SEQN name waist age asthma health .fitted .resid
19.9 14.98 4.92 71918 modA 71.9 8 Asthma Other 3.87 0.59
8.6 7.59 1.01 71921 modA 46.8 3 Never Excellent 2.75 0.18
20.5 15.12 5.38 71923 modA 72.3 12 Never Other 3.89 0.64
12.9 10.04 2.86 71924 modA 56.1 8 Never Excellent 3.17 0.42
23.4 22.12 1.28 71930 modA 90.6 12 Never Other 4.70 0.13
18.8 15.30 3.50 71934 modA 72.8 10 Asthma VeryGood 3.91 0.42

Test-sample predictions/errors: models B-D

test_mB <- augment(modB, newdata = nnyfs_test) |>
  mutate(name = "modB", fit_triceps = .fitted^2,
         res_triceps = triceps - fit_triceps) 

test_mC <- augment(modC, newdata = nnyfs_test) |>
  mutate(name = "modC", fit_triceps = .fitted^2,
         res_triceps = triceps - fit_triceps) 

test_mD <- augment(modD, newdata = nnyfs_test) |>
  mutate(name = "modD", fit_triceps = .fitted^2,
         res_triceps = triceps - fit_triceps) 

Combined test results: all four models

test_comp <- bind_rows(test_mA, test_mB, test_mC, test_mD) |>
  arrange(SEQN, name)

test_comp |> select(triceps, fit_triceps, res_triceps, SEQN, 
                    name, everything()) |> 
  head(5) |> gt() |> fmt_number(decimals = 2, columns = c(2:3, 10:11)) |>
  tab_options(table.font.size = 24)
triceps fit_triceps res_triceps SEQN name waist age asthma health .fitted .resid
19.9 14.98 4.92 71918 modA 71.9 8 Asthma Other 3.87 0.59
19.9 16.03 3.87 71918 modB 71.9 8 Asthma Other 4.00 0.46
19.9 15.79 4.11 71918 modC 71.9 8 Asthma Other 3.97 0.49
19.9 15.85 4.05 71918 modD 71.9 8 Asthma Other 3.98 0.48
8.6 7.59 1.01 71921 modA 46.8 3 Never Excellent 2.75 0.18

Comparing Model Predictions

test_comp |>
  group_by(name) |>
  summarize(n = n(),
            MAPE = mean(abs(res_triceps)), 
            RMSPE = sqrt(mean(res_triceps^2)),
            max_error = max(abs(res_triceps)),
            median_APE = median(abs(res_triceps)),
            valid_R2 = cor(triceps, fit_triceps)^2) |>
  gt() |> fmt_number(decimals = 4, columns = -"n") |>
    tab_options(table.font.size = 24)
name n MAPE RMSPE max_error median_APE valid_R2
modA 497 3.0713 4.0411 16.7973 2.3048 0.6467
modB 497 2.8529 3.8502 20.8783 2.1706 0.6809
modC 497 2.8641 3.8713 20.4745 2.0880 0.6773
modD 497 2.8646 3.8705 20.4190 2.1044 0.6774

Overall Conclusions, 1

We fit modA, modB, modC and modD to predict \(\sqrt{triceps}\) after deleting 21 subjects with missing triceps and imputing (singly) one missing weight, yielding 1497 subjects.

  • Training Sample: differences between models not large
    • modC has best adjusted \(R^2\), \(\hat{\sigma}\) and AIC
    • modB has best BIC, modD (of course) has best \(R^2\)
  • Residual plots: Linearity and Normality look OK.
    • Maybe an increasing trend in scale-location plot.
  • No signs of meaningful collinearity in training sample.

Overall Conclusions, 2

  • Test Sample: differences also not especially large
    • modB has smallest MAPE, RMSPE, largest validated \(R^2\)
    • modA has smallest maximum error
    • modC has smallest median error

I’d probably go with modB (emphasize predictive performance) but there’s a case for some other choices.

  • Training sample: modB (waist + age) shows \(R^2\) = 0.682
  • Test sample: modB validated \(R^2\) = 0.681 so…

summary(modB): using 1497 subjects

modB_1497 <- lm(sqrt(triceps) ~ waist + age, data = nnyfs) 
modB_1497 |> summary()

Call:
lm(formula = sqrt(triceps) ~ waist + age, data = nnyfs)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.74788 -0.32237 -0.00903  0.30179  1.61454 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.519068   0.057782   8.983   <2e-16 ***
waist        0.056724   0.001114  50.929   <2e-16 ***
age         -0.073006   0.004516 -16.167   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.475 on 1494 degrees of freedom
Multiple R-squared:  0.6819,    Adjusted R-squared:  0.6815 
F-statistic:  1601 on 2 and 1494 DF,  p-value: < 2.2e-16

tbl_regression() from gtsummary

modB_1497 |> 
  tbl_regression(conf.level = 0.90, intercept = TRUE) |>
  add_vif() |>
  as_gt() |> 
  tab_options(table.font.size = 24)
Characteristic Beta 90% CI1 p-value VIF1
(Intercept) 0.52 0.42, 0.61 <0.001
waist 0.06 0.05, 0.06 <0.001 1.8
age -0.07 -0.08, -0.07 <0.001 1.8
1 CI = Confidence Interval, VIF = Variance Inflation Factor

Training Sample: modD

Using just the training sample (n = 1000), we have:

modD |> 
  tbl_regression(conf.level = 0.90, 
                 intercept = TRUE, 
                 add_estimate_to_reference_rows = TRUE) |>
  add_global_p() |> 
  add_vif() |> 
  bold_labels() |>
  italicize_levels() |>
  as_gt() |> 
  fmt_number(decimals = 3) |> 
  tab_options(table.font.size = 20)

Training Sample: modD

Characteristic Beta 90% CI1 p-value GVIF1 Adjusted GVIF2,1
(Intercept) 0.477 0.34, 0.61 0.000

waist 0.057 0.05, 0.06 0.000 1.896 1.377
age −0.075 -0.08, -0.07 0.000 1.860 1.364
asthma

0.779 1.047 1.023
    Asthma 0.000


    Never −0.012 -0.08, 0.06


health

0.002 1.066 1.016
    Excellent 0.000


    VeryGood 0.123 0.06, 0.18


    Other 0.004 -0.06, 0.07


1 CI = Confidence Interval, GVIF = Generalized Variance Inflation Factor
2 GVIF^[1/(2*df)]

Session Information

session_info()
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.1 (2023-06-16 ucrt)
 os       Windows 11 x64 (build 22621)
 system   x86_64, mingw32
 ui       RTerm
 language (EN)
 collate  English_United States.utf8
 ctype    English_United States.utf8
 tz       America/New_York
 date     2023-10-29
 pandoc   3.1.1 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package       * version  date (UTC) lib source
 abind           1.4-5    2016-07-21 [1] CRAN (R 4.3.0)
 backports       1.4.1    2021-12-13 [1] CRAN (R 4.3.0)
 broom         * 1.0.5    2023-06-09 [1] CRAN (R 4.3.1)
 broom.helpers   1.14.0   2023-08-07 [1] CRAN (R 4.3.1)
 car           * 3.1-2    2023-03-30 [1] CRAN (R 4.3.1)
 carData       * 3.0-5    2022-01-06 [1] CRAN (R 4.3.1)
 cli             3.6.1    2023-03-23 [1] CRAN (R 4.3.1)
 colorspace      2.1-0    2023-01-23 [1] CRAN (R 4.3.1)
 commonmark      1.9.0    2023-03-17 [1] CRAN (R 4.3.1)
 corrr         * 0.4.4    2022-08-16 [1] CRAN (R 4.3.1)
 digest          0.6.33   2023-07-07 [1] CRAN (R 4.3.1)
 dplyr         * 1.1.3    2023-09-03 [1] CRAN (R 4.3.1)
 ellipsis        0.3.2    2021-04-29 [1] CRAN (R 4.3.1)
 evaluate        0.22     2023-09-29 [1] CRAN (R 4.3.1)
 fansi           1.0.5    2023-10-08 [1] CRAN (R 4.3.1)
 farver          2.1.1    2022-07-06 [1] CRAN (R 4.3.1)
 fastmap         1.1.1    2023-02-24 [1] CRAN (R 4.3.1)
 forcats       * 1.0.0    2023-01-29 [1] CRAN (R 4.3.1)
 generics        0.1.3    2022-07-05 [1] CRAN (R 4.3.1)
 GGally        * 2.1.2    2021-06-21 [1] CRAN (R 4.3.1)
 ggforce         0.4.1    2022-10-04 [1] CRAN (R 4.3.1)
 ggformula     * 0.10.4   2023-04-11 [1] CRAN (R 4.3.1)
 ggplot2       * 3.4.4    2023-10-12 [1] CRAN (R 4.3.1)
 ggrepel       * 0.9.4    2023-10-13 [1] CRAN (R 4.3.1)
 ggridges        0.5.4    2022-09-26 [1] CRAN (R 4.3.1)
 ggstance        0.3.6    2022-11-16 [1] CRAN (R 4.3.1)
 glue            1.6.2    2022-02-24 [1] CRAN (R 4.3.1)
 gower           1.0.1    2022-12-22 [1] CRAN (R 4.3.0)
 gt            * 0.10.0   2023-10-07 [1] CRAN (R 4.3.1)
 gtable          0.3.4    2023-08-21 [1] CRAN (R 4.3.1)
 gtsummary     * 1.7.2    2023-07-15 [1] CRAN (R 4.3.1)
 haven           2.5.3    2023-06-30 [1] CRAN (R 4.3.1)
 hms             1.1.3    2023-03-21 [1] CRAN (R 4.3.1)
 htmltools       0.5.6.1  2023-10-06 [1] CRAN (R 4.3.1)
 janitor       * 2.2.0    2023-02-02 [1] CRAN (R 4.3.1)
 jsonlite        1.8.7    2023-06-29 [1] CRAN (R 4.3.1)
 knitr           1.44     2023-09-11 [1] CRAN (R 4.3.1)
 labeling        0.4.3    2023-08-29 [1] CRAN (R 4.3.1)
 labelled        2.12.0   2023-06-21 [1] CRAN (R 4.3.1)
 lattice       * 0.21-8   2023-04-05 [2] CRAN (R 4.3.1)
 lifecycle       1.0.3    2022-10-07 [1] CRAN (R 4.3.1)
 lubridate     * 1.9.3    2023-09-27 [1] CRAN (R 4.3.1)
 magrittr        2.0.3    2022-03-30 [1] CRAN (R 4.3.1)
 markdown        1.11     2023-10-19 [1] CRAN (R 4.3.1)
 MASS            7.3-60   2023-05-04 [2] CRAN (R 4.3.1)
 Matrix        * 1.6-1.1  2023-09-18 [1] CRAN (R 4.3.1)
 mgcv            1.8-42   2023-03-02 [2] CRAN (R 4.3.1)
 mosaic        * 1.8.4.2  2022-09-20 [1] CRAN (R 4.3.1)
 mosaicCore      0.9.2.1  2022-09-22 [1] CRAN (R 4.3.1)
 mosaicData    * 0.20.3   2022-09-01 [1] CRAN (R 4.3.1)
 munsell         0.5.0    2018-06-12 [1] CRAN (R 4.3.1)
 naniar        * 1.0.0    2023-02-02 [1] CRAN (R 4.3.1)
 nlme            3.1-162  2023-01-31 [2] CRAN (R 4.3.1)
 norm            1.0-11.1 2023-06-18 [1] CRAN (R 4.3.1)
 patchwork     * 1.1.3    2023-08-14 [1] CRAN (R 4.3.1)
 pillar          1.9.0    2023-03-22 [1] CRAN (R 4.3.1)
 pkgconfig       2.0.3    2019-09-22 [1] CRAN (R 4.3.1)
 plyr            1.8.9    2023-10-02 [1] CRAN (R 4.3.1)
 polyclip        1.10-6   2023-09-27 [1] CRAN (R 4.3.1)
 purrr         * 1.0.2    2023-08-10 [1] CRAN (R 4.3.1)
 R6              2.5.1    2021-08-19 [1] CRAN (R 4.3.1)
 RColorBrewer    1.1-3    2022-04-03 [1] CRAN (R 4.3.0)
 Rcpp            1.0.11   2023-07-06 [1] CRAN (R 4.3.1)
 readr         * 2.1.4    2023-02-10 [1] CRAN (R 4.3.1)
 reshape         0.8.9    2022-04-12 [1] CRAN (R 4.3.1)
 rlang           1.1.1    2023-04-28 [1] CRAN (R 4.3.1)
 rmarkdown       2.25     2023-09-18 [1] CRAN (R 4.3.1)
 rpart           4.1.19   2022-10-21 [2] CRAN (R 4.3.1)
 rstudioapi      0.15.0   2023-07-07 [1] CRAN (R 4.3.1)
 sass            0.4.7    2023-07-15 [1] CRAN (R 4.3.1)
 scales          1.2.1    2022-08-20 [1] CRAN (R 4.3.1)
 sessioninfo   * 1.2.2    2021-12-06 [1] CRAN (R 4.3.1)
 simputation   * 0.2.8    2022-06-16 [1] CRAN (R 4.3.1)
 snakecase       0.11.1   2023-08-27 [1] CRAN (R 4.3.1)
 stringi         1.7.12   2023-01-11 [1] CRAN (R 4.3.0)
 stringr       * 1.5.0    2022-12-02 [1] CRAN (R 4.3.1)
 tibble        * 3.2.1    2023-03-20 [1] CRAN (R 4.3.1)
 tidyr         * 1.3.0    2023-01-24 [1] CRAN (R 4.3.1)
 tidyselect      1.2.0    2022-10-10 [1] CRAN (R 4.3.1)
 tidyverse     * 2.0.0    2023-02-22 [1] CRAN (R 4.3.1)
 timechange      0.2.0    2023-01-11 [1] CRAN (R 4.3.1)
 tweenr          2.0.2    2022-09-06 [1] CRAN (R 4.3.1)
 tzdb            0.4.0    2023-05-12 [1] CRAN (R 4.3.1)
 utf8            1.2.3    2023-01-31 [1] CRAN (R 4.3.1)
 vctrs           0.6.4    2023-10-12 [1] CRAN (R 4.3.1)
 visdat          0.6.0    2023-02-02 [1] CRAN (R 4.3.1)
 withr           2.5.1    2023-09-26 [1] CRAN (R 4.3.1)
 xfun            0.40     2023-08-09 [1] CRAN (R 4.3.1)
 xml2            1.3.5    2023-07-06 [1] CRAN (R 4.3.1)
 yaml            2.3.7    2023-01-23 [1] CRAN (R 4.3.0)

 [1] C:/Users/thoma/AppData/Local/R/win-library/4.3
 [2] C:/Program Files/R/R-4.3.1/library

──────────────────────────────────────────────────────────────────────────────

Appendix

Obtaining Residual Plots with plot(), instead

modA Residuals via plot()

par(mfrow = c(2,2)); plot(modA); par(mfrow = c(1,1))

modB Residuals via plot()

par(mfrow = c(2,2)); plot(modB); par(mfrow = c(1,1))

modC Residuals via plot()

par(mfrow = c(2,2)); plot(modC); par(mfrow = c(1,1))

modD Residuals via plot()

par(mfrow = c(2,2)); plot(modD); par(mfrow = c(1,1))